Python pandas 套件如何篩選 DataFrame 資料？【Python 處理 Excel #8】

2024 iThome 鐵人賽

DAY 8

Python

30 天學會用 Python pandas 和 openpyxl 處理 Excel —— 成為用 Python 處理 Excel 檔案的高手系列第 8 篇

16th鐵人賽 python excel pandas

april

2024-09-21 08:29:53

282 瀏覽

分享至

本篇文章同步發布於 Python pandas 套件如何篩選 DataFrame 資料？【Python 處理 Excel #8】

前言

這篇文章介紹 Python pandas 如何篩選 DataFrame 資料。

文章案例說明

文章中使用 example.xlsx 作為說明用的案例資料。example.xlsx 的內容如下：

order_id	product_name	unit_price	ship_date
10000	S7000	700	2024/9/1
10001	A5000	500	2024/12/6
10002	A3000	300	2024/9/13
10003	T-APP	3000	2024/10/8

基於條件的資料篩選

pandas 可以使用布林值 (True 或 False) 篩選 DataFrame 的資料。例如，若要篩選 unit_price 大於 500 的產品，可以這樣做：

import pandas as pd

# 從 Excel 檔案讀取資料
df = pd.read_excel('example.xlsx')

# 篩選 unit_price 大於 500 的產品
filtered_df = df[df['unit_price'] > 500]

print(filtered_df)

以上程式碼會返回 unit_price 大於 500 的所有列。

使用字串方法進行篩選

str.startswith() 用於篩選特定開頭的資料。例如想要篩選 product_name 以「A」開頭的所有資料，可以使用 str.startswith() 方法：

import pandas as pd

# 從 Excel 檔案讀取資料
df = pd.read_excel('example.xlsx')

# 篩選 product_name 以 'A' 開頭的資料
filtered_df = df[df['product_name'].str.startswith('A')]

print(filtered_df)

使用正則表達式 (regex) 篩選

正則表達式 (regular expression，簡稱 regex 或 regexp) 是一種用來搜尋和匹配特定格式的字串的式子。許多程式語言支持正則表達式，例如 JavaScript、Python、Java、C#、Perl 等。

例子一

假設想要篩選出 DataFrame 中所有 product_name 中包含數字「3000」的資料：

# 使用正則表達式篩選 product_name 中包含「3000」的資料
filtered_df = df[df['product_name'].str.contains(r'3000')]
print(filtered_df)

解釋

str.contains(r'3000')：str.contains() 用於檢查每個字符串是否包含指定的模式 (pattern)。在這裡，r'3000 是一個正則表達式，表示要查找的字串規則。
這段程式碼輸出結果會顯示 product_name 是「A3000」的資料。

例子二

如果想要篩選所有 product_name 中包含數字的資料：

# 使用正則表達式篩選 product_name 中包含數字的資料
filtered_df = df[df['product_name'].str.contains(r'\d')]
print(filtered_df)

解釋

r'\d'：這是一個正則表達式，\d 代表任何數字字符 (等同於 [0-9])，表示要尋找的模式是任何包含數字的字符串。
str.contains(r'\d')：檢查每個 product_name 是否包含任何數字。
這段程式碼輸出結果會顯示 product_name 為 S7000、A5000、A3000 的資料。

如何篩選未來三個月的資料？

之前的文章提到如何取得未來三個月的日期，這裡說明如何篩選未來三個月的資料。

取得當前日期

首先取得當前日期並計算未來三個月的範圍：

from datetime import datetime, timedelta
import calendar

# 取得當前日期和時間
current_date = datetime.now()
# 計算未來三個月的開始和結束日期
three_months_later = current_date + timedelta(days=90)

篩選資料

接下來，使用這兩個起訖日期篩選未來三個月的資料：

# 篩選未來三個月的資料
future_data = df[(df['ship_date'] >= current_date) & (df['ship_date'] <= three_months_later)]

這段程式碼篩選出 ship_date 在未來三個月內的所有列，使用了布林邏輯實現多重條件篩選。

總結

資料篩選是 pandas 的基本功能，可以使用布林值、字串方法和正則表達式等進行篩選。
布林值可以作為篩選的索引或條件，用於選擇符合特定條件的列或行。
字串方法 (例如 str.startswith()) 可以快速篩選特定條件的資料。
正則表達式是處理和篩選複雜字串的實用工具。
搭配 datetime 和 timedelta 模組，pandas 可以有效篩選特定日期範圍內的資料。

本篇文章同步發布於 Python pandas 套件如何篩選 DataFrame 資料？【Python 處理 Excel #8】

Python pandas 如何處理含有無效日期或缺失值的日期欄位？【Python 處理 Excel #7】

Python pandas 套件如何排序 DataFrame 資料？【Python 處理 Excel #9】

系列文

30 天學會用 Python pandas 和 openpyxl 處理 Excel —— 成為用 Python 處理 Excel 檔案的高手共 30 篇

RSS系列文訂閱系列文

9 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22209 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

30 天學會用 Python pandas 和 openpyxl 處理 Excel —— 成為用 Python 處理 Excel 檔案的高手系列 第 8 篇